Google Malaria and you'll see this:

A first search in Google classifies Malaria as a rare disease

Malaria is considered a rare disease.
But read carefully, and you'll notice that catch: Malaria is a rare disease in the U.S..

Reality is very different if you look beyond America's borders. In fact,in low income countries, malaria is among the top-10 leading causes of death:

In 2016, malaria ranked 6th among the leading causes of death in low-income countries

To get a better grasp of what's going on, let's do our own analysis on malaria. We'll use data from the MalariaAtlas project and made available in this GitHub repo.

In [1]:
#Import packages
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import numpy as np

#Define plotting backend
pd.options.plotting.backend = 'plotly'

#Define data location
url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-13/malaria_deaths.csv'
C:\Users\Felipe\Anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [2]:
#Load data
data = pd.read_csv(url)

#Rename columns
data.columns = ['Entity','Code','Year','Deaths_per_100k']

#The data contains three kinds of entities:
# 1. countries (e.g. Nigeria), 
# 2. regions (e.g. Western Sub-Saharan Africa)
# 3. the World.

#Since it is countries, not regions, who have political autonomy to invest in health systems,
#or really do anything about malaria, we'll focus our analysis on countries.
#We're also interesting in global values, so we'll keep the world as well.

#Our dataset has codes for Countries and the World, but not for Regions.
#Filter countries and the world.
data = data[pd.notnull(data.Code)]


#I'll want to create a new variable:
#How much was the mortality rate increased (+) or decreased (-) since the start of our series (1990)
#---------------------------------------------------------------------------------------------------
#Get mortality level from start of series
data = pd.merge(data, data[data.Year == min(data.Year)], how = 'left', on = 'Code')
#Define variable
data['Mortality_rate_percent_change'] = (((data['Deaths_per_100k_x'] + 1e-9)/(data['Deaths_per_100k_y'] + 1e-9)) - 1) * 100
#Drop columns replicated by merging process
data = data.drop(columns = ['Entity_y','Year_y','Deaths_per_100k_y'])
#Rename columns
data.columns = ['Entity','Code','Year','Deaths_per_100k','Mortality_rate_percent_change']
#---------------------------------------------------------------------------------------------------

#Set world and individual countries to have different colors in future plots
data['color'] = 'Country'
data.loc[data.Entity == 'World','color'] = 'World'

The good news is that the world has been reducing Deaths by malaria:

In [3]:
fig = data[data.Entity == 'World'].plot('Year',
                                        'Deaths_per_100k',
                                        color = 'color',
                                        color_discrete_map = {'World': 'blue'},
                                        hover_data = {'Entity': False,
                                                      'Year': True,
                                                      'Deaths_per_100k': ':.2f',
                                                      'Mortality_rate_percent_change': ':.2f'},
                                       title = 'Malaria mortality only started decreasing globally since 2003')
fig.update_xaxes(title_text='Year',
                tickvals = list(range(1990,2017, 5)))
fig.update_yaxes(title_text='Deaths by malaria per 100k')

                 
fig.show()

However, up to 2003, malaria was rising steadily.
Only after 2003 did the world start to defeat the desease.
That's fairly recent, don't you think?

Look at it in another way: it took the world untill 2011 to get malaria back to the same level of mortality as it did in 1990.
That's 21 years!

Well, better late than never!

But let's not be to hasty... The bad news is that a global statistic doesn't really say much. Global statistics may be hiding one of the major global problems: inequality. When this happens, global aggregates may cease to be informative. So let's check if the global statistic is representative of what happens to countries individually:

In [4]:
fig = px.line(
    data,
    title= "The global mortality ratesays very little about each country\'s reality",
    x='Year',
    y='Deaths_per_100k',
    hover_data = {'Entity':True,
                  'Year': True,
                  'color': False,
                  'Deaths_per_100k': ':.0f',
                  'Mortality_rate_percent_change': ':.0f'},
    color='color',
    color_discrete_map = {"Country": "lightgrey" , "World":"blue"},
    #symbol = 'color',
    width = 800,
    height =600
)

fig.update_xaxes(title_text='Year',
                tickvals = list(range(1990,2017, 5)))
fig.update_yaxes(title_text='Deaths by malaria per 100k')
fig.show()

What a mess!
It's not like all countries follow more or less what happens on the global level.
Rather, it seems there are many countries whose mortality is far higher than the global level, and they follow quite chaotic and heterogeneous paths!

Where are these countries?

In [5]:
#To answer these questions, we'll need to bring in geographical data:

#Load world shapefile
world = gpd.read_file("C:\\Users\\Felipe\\Anaconda3\\lib\\site-packages\\geopandas\\datasets\\naturalearth_lowres\\naturalearth_lowres.shp")
world = world.loc[:,['continent','name','iso_a3','geometry']]

#Merge to our dataset
geodata = pd.merge(world, data, how = 'inner', left_on = 'iso_a3', right_on = 'Code')

#Make map
geodata[geodata.Year == max(geodata.Year)].plot(cmap = 'OrRd',column = 'Deaths_per_100k', figsize = (10,10))
plt.title('The countries with the highest mortality rates are mostly in Africa')
plt.show()

Malaria seems clearly a much bigger problem in Africa than anywhere else in the world!
Indeed, it seems pretty much like it's confined to Africa. That's why our first Google searched deemed it as a rare disease in the US.

If we look at the distribution of malaria by continent, we can see that, indeed, malaria is not confined to Africa. But it does strike Africa much harder than it does any other continent.
But there's a good news here: malaria has been significantly declining, even in Africa.
Take a look at what happens to Africa:

In [6]:
fig = px.bar((geodata.
              loc[:,['continent','Year','Deaths_per_100k']].
              groupby(['continent','Year'], as_index=False).mean().
              sort_values(['Year','Deaths_per_100k'], ascending=[True,False])), 
              x='continent', 
              y='Deaths_per_100k',
              animation_frame = 'Year',
              hover_data={'continent': True, 
                          'Year': True, 
                          'Deaths_per_100k': ':.2f'},
              range_y = [0,80],
            title = 'Average of country mortality rate by continent')

fig.update_yaxes(title_text='Deaths by malaria per 100k')

fig.show()

We see deaths in Africa increasing and then, suddenly, changing direction and decreasing significantly.

Such decrease, however, has not been uniform in all countries. We can look at each country individually in the heatmap bellow. Note how most of the dark lines get softer over time. This means these countries have done progress in combating malaria.

In [7]:
#Most countries managed to decrease significantly the mortality rate
#e.g. Burundi; Malawi, Mozambique
plt.figure(figsize=(10,10))
heatmap_data = data.pivot("Entity", "Year", "Deaths_per_100k").sort_values(by = 'Year', axis = 1)

ax = sns.heatmap(heatmap_data, cmap = "OrRd")
plt.title("Heatmap Malaria Data")
plt.show()

Yet, for most of those countries, malaria isn't really an issue. So how do we identify countries where malaria really is an issue?

If a country has one of the 10 highest mortality rates in a year, it's fair to say it has an issue with malaria. So let's consider all countries that have been on this list for at least a year.
In other words, we'll look at all countries that have had one of the 10 highest mortality rates for at least one year during the 1990-2016 period.

Yet, for most of those countries, malaria isn't really an issue. So how do we identify countries where malaria really is an issue?

Let's begin by looking at the 10 countries with the highest mortality rates in 2016, and back in 1990.
First, 2016:

In [8]:
N = 20

#Which countries have been among the top N mortality at least one year from 1990 to 2016
set_of_countries = set()
for year in range(1990, 2016):
    set_of_countries = set_of_countries.union(set(data[data.Year == year].sort_values(['Deaths_per_100k'], ascending=False).head(N).Entity))
    
top = geodata[geodata.Entity.isin(set_of_countries)]

print('These are the countries:')
print(set_of_countries)
These are the countries:
{'Mali', 'Solomon Islands', 'Cameroon', 'Guinea', 'Burundi', 'Mozambique', 'Equatorial Guinea', 'Rwanda', 'Senegal', 'Sierra Leone', 'Burkina Faso', 'Democratic Republic of Congo', 'Ghana', 'Congo', 'Nigeria', 'Central African Republic', 'Benin', "Cote d'Ivoire", 'Guinea-Bissau', 'Gabon', 'Uganda', 'Togo', 'Malawi', 'Niger', 'Tanzania', 'Liberia'}

How did malaria mortality evolve in these countries?

The heatmap below suggests it really depends on the country. Burkina Faso didn't seem to have done much progress at all. Burundi, however, seems to have reduced malaria mortality substantially!

In [9]:
#Most countries managed to decrease significantly the mortality rate
#e.g. Burundi; Malawi, Mozambique
plt.figure(figsize=(10,10))
heatmap_data = top.pivot("Entity", "Year", "Deaths_per_100k").sort_values(by = 'Year', axis = 1)

ax = sns.heatmap(heatmap_data, cmap = "OrRd")
plt.title("Heatmap Malaria Data")
plt.show()

But let's be fair: even among the most harshly afflicted countries, there may still be countries that are much better or worse than others. So let's compare how they performed in improving relative to were they began, in 1990:

In [10]:
fig = px.line(top,
             x='Year',
             y='Mortality_rate_percent_change',
             color = 'Entity',
             hover_data={'Entity': True, 
                          'Year': True, 
                          'Deaths_per_100k': ':.2f',
                          'Mortality_rate_percent_change': ':.2f'}
            )
fig.show()

While most countries experience an overall decrease in mortality, not all do: Cameroon experienced a galopping 21% increase, and Equatorial Guinea is not much better.

In [11]:
(top.
 loc[(top.Year == 2016) & (top.Mortality_rate_percent_change > 0),['name','Mortality_rate_percent_change']].
 sort_values('Mortality_rate_percent_change', ascending=False)
)
Out[11]:
name Mortality_rate_percent_change
1430 Cameroon 21.869753
1754 Eq. Guinea 17.536224
1349 Benin 10.697466
1673 Central African Rep. 4.925722
1700 Congo 4.227996
1538 Guinea 3.999867

How much does malaria affect the countries we're currently studying?
It depends what do we mean by "affect". Affect what?
Let's look at affecting life expectancy.
Warning: this will be a very informal, not-at-all-statistically-rigorous analysis. But it will be fun! So let's do it!

There are many reasons why people die. Malaria does play a role... but how big is this role?
Among the countries in our sample, some of them have had major problems besides malaria.
Sierra Leone, for instance, was in civil war until 2002.

We'll need data on life expectancy. We can get that from the gapminder dataset. Unfortunately, some years will be dropped, as we don't have life expecancy for every year.

In [12]:
from gapminder import gapminder
top = pd.merge(top, gapminder, how = 'inner', left_on = ['Entity','Year'], right_on = ['country','year'], validate = '1:1')

Countries with higher malaria mortality rates are also countries with lower life expectancies.
A steeper regression line suggests a stronger association between the two.
It is therefore interesting to observe that the regression line becomes more horizontal over time, suggesting that malaria has become responsible for a smaller and smaller share of the factors that decrease life expectancy.

In [13]:
#Color Rwanda -- tell you why later
top['color'] = 'Other top-10 mortality Country'
top.loc[top.Entity == 'Rwanda','color'] = 'Rwanda'

#Make graph
fig = px.scatter(top,
                 x = 'Deaths_per_100k',
                 y = 'lifeExp',
                 title = 'Malaria is becoming less important to explain life expectancy',
                 animation_frame='Year',
                 hover_data={'color': False,
                             'Entity': True, 
                              'Year': True, 
                              'Deaths_per_100k': ':.2f',
                              'Mortality_rate_percent_change': ':.2f'},
                 color='color',
                 color_discrete_map = {"Rwanda": "navy" , "Other top-10 mortality Country":"black"},
                 trendline='ols',
                range_x = [0, 250],
                range_y = [20, 65]
                )
fig.show()
C:\Users\Felipe\Anaconda3\lib\site-packages\statsmodels\compat\pandas.py:23: FutureWarning:

The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version

Interestingly, note the blue dot.
That's Rwanda. Rwanda starts as a clear outlier in the graph: it's life expectancy is well below average.
That's because Rwanda was the unfortunate stage of a series of violent activities that culminated in a horribly bloody massacre. Watch how its life expectancy dramatically increases, and how the number of malaria casualties decreases.

Looking at all the points, we are tempted to make two hipothesis:

  1. When our series begins, malaria seems correlated to life expectancy, suggesting it is an important factor in determining how long people live in these countries;
  2. When our series ends, malaria does not seem correlated to life expectancy anymore, suggesting its grown weaker and other factors have become more important in explaining life expectancy.

To make this insight a little more palpable, we let's explain life expectancy by number of deaths due to malaria. But let's control by gdp per capita and population. The amount of variance in life expectancy explained by malaria will be the change in R2 obtained by adding malaria to a regression containing only the control variables. We'll do this for every year and see how this evolves:

In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

variance_in_lifeExp_explained_by_Malaria = {} 

for year in sorted(set(top.Year)):    
    X_1 = top.loc[top.Year == year, ['pop','gdpPercap']].to_numpy()
    X_2 = top.loc[top.Year == year, ['pop','gdpPercap','Deaths_per_100k']].to_numpy()
    y = top.loc[top.Year == year, ['lifeExp']].to_numpy()
    
    SSE1 = sum((y - LinearRegression().fit(X_1,y).predict(X_1))**2) #reduced model
    SSE2 = sum((y - LinearRegression().fit(X_2,y).predict(X_2))**2) #full model

    variance_in_lifeExp_explained_by_Malaria[year] = float((SSE1 - SSE2)/SSE1)
    
    print(f'In {year}, malaria explains {100 * variance_in_lifeExp_explained_by_Malaria[year]:.0f}% of why some countries had higher life expectancies than others')
    
In 1992, malaria explains 3% of why some countries had higher life expectancies than others
In 1997, malaria explains 18% of why some countries had higher life expectancies than others
In 2002, malaria explains 8% of why some countries had higher life expectancies than others
In 2007, malaria explains 1% of why some countries had higher life expectancies than others

These numbers call our attention to the heterogeneity that exists even among the most afflicted countries. In 1997, an astonishing 18% of the variability of life expectancy among countries was explained by malaria alone.
That's outrageous! And we're only talking about countries that were all in very bad shape as far as malaria was concerned.
But it gets worse...

Sao Tome and Principe is not in our sample, but deserves to be mentioned. Through a series of public policies initiated in 2005, has managed to reduce malaria mortality by 95% in just two years, and was aiming to erradicate the disease altogether by 2020... were it not for COVID-19 pandemic.

But COVID-19 is real -- and some experts believe it may double malaria's casualties this year in Africa.